# Required Libraries
library(car)
Load the 04NYCRestaurants.txt dataset into your workspace. This dataset contains survey results from customers of 168 different Italian restaurants in the New York City area. The data are in the form of the average of customer views on various attributes (food, decor, and service) scored on a scale from 1 to 30, along with the average price of dinner. There is also a categorical variable for the location of the restaurant.
rest <- read.table('04NYCRestaurants.txt', header = TRUE, sep = " ", quote = "\"", stringsAsFactors = FALSE)
rest$Location <- as.factor(rest$Location)
plot(rest[, 2:5], col = rest$Location)
Looking at the scatterplot matrix there appears to be many variables that are correlated with each other.
Write out the regression equation. Price = -21.956 + 1.538Food + 1.910Decor - 0.003Service -2.068LocationWest
Interpret the meaning each of the 5 coefficients in context of the problem. Intercept coefficient - assuming food, decor and service were rated 0 and the location was east we would assume an average price of -21.956. This doenst make sense in terms of the context of the problem, it is just the fixed point where the line is anchored.
Food coefficient - holding all else constant an increase of 1 in the food rating will increase price on average by 1.54
Decor coefficient - holding all else constant an increase of 1 in the decor rating will increase price on average by 1.91
Service coefficient - holding all else constant an increase of 1 in the service rating will decrease price on average by 0.003
LocationWest coefficient - holding all else constant a restaurant in the West is on average 2.07 cheaper than a restaurant in the East
Are the coefficients significant? How can you tell?
Based on the multiple linear regression, the Intercept, Food, Decor, and LocationWest coefficients are statistically significant (p-value less than 0.05)
The overall model is significant, the F test shows a p-value less than .05
The RSE is 5.738, estimated standard deviation of the residual errors
The adjusted coefficient of determination is .6187, this means that roughly 62% of the variation in price can be explained by the included variables in the model
restModFull <- lm(Price ~ . -Restaurant, data = rest)
summary(restModFull)
##
## Call:
## lm(formula = Price ~ . - Restaurant, data = rest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.0465 -3.8837 0.0373 3.3942 17.7491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.955750 4.857969 -4.520 1.19e-05 ***
## Food 1.538120 0.368951 4.169 4.96e-05 ***
## Decor 1.910087 0.217005 8.802 1.87e-15 ***
## Service -0.002727 0.396232 -0.007 0.9945
## LocationWest -2.068050 0.946739 -2.184 0.0304 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.738 on 163 degrees of freedom
## Multiple R-squared: 0.6279, Adjusted R-squared: 0.6187
## F-statistic: 68.76 on 4 and 163 DF, p-value: < 2.2e-16
plot(restModFull)
The QQ plot appears to show a violation of the normality assumption. The scale location plot also appears to show a violation of the independent errors assumption.
influencePlot(restModFull)
## StudRes Hat CookD
## 56 3.2666518 0.05010858 0.32600253
## 130 2.9463084 0.07181092 0.35815562
## 168 0.4012884 0.21011533 0.09279813
There are a few restaurants that are of concern, 56 and 130 have high residuals, however their leverage is low. Restaurant 168 has high leverage but a small residual.
vif(restModFull)
## Food Decor Service Location
## 2.714261 1.744851 3.558735 1.064985
The VIF for service is the highest (3.56), which was expected given that it was the least significant coefficient.
avPlots(restModFull)
The added variable plots show that Food and Decor are the most powerful predictors in the model (as expected based on coefficient p-values). While location has some impact it is relatively small. The service variable appears to be the least helpful in predicting price.
restModSvc <- lm(Price ~ Service, data = rest)
summary(restModSvc)
##
## Call:
## lm(formula = Price ~ Service, data = rest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6646 -4.7540 -0.2093 4.3368 26.2460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.9778 5.1093 -2.344 0.0202 *
## Service 2.8184 0.2618 10.764 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.153 on 166 degrees of freedom
## Multiple R-squared: 0.4111, Adjusted R-squared: 0.4075
## F-statistic: 115.9 on 1 and 166 DF, p-value: < 2.2e-16
plot(Price ~ Service, data = rest)
abline(restModSvc, lty = 2)
Creating a simple linear regression to predict price based soley on service rating appears to be statistically significant (p value < 0.05). According to the model an increase of 1 in the service rating would increase price on average by 2.82. However, based on our previous model it appears that the service rating is influenced by the quality of food and decor in the restaurant, therefore a better model would be to predict price based on food, decor and location.
The model summary() output. This model appears to be better than the full model. All coefficients are now statistically significant. The model based on the F-test is still significant (p-value less than .05). We have a small decrease in the RSE to 5.72 and a slight increase in the adjusted coefficient of determination to .6211
The assumptions of multiple linear regression. As with the full model, the QQ plot appears to show a violation of the normality assumption. In addition, the scale location plot also appears to show a violation of the independent errors assumption.
The influence plot of the model. 56 and 130 are still outliers in this model. 117 has relatively high leverage but a low residual error.
The variance inflation factors of the coefficients. By removing the service variable the VIF for the remaining coefficients was reduced.
The added variable plots for the model. The added variable plots show that Food and Decor are the most powerful predictors in the model (as expected based on coefficient p-values). While location has some impact it is relatively small.
restModNew <- lm(Price ~ . -Restaurant -Service, data = rest)
summary(restModNew)
##
## Call:
## lm(formula = Price ~ . - Restaurant - Service, data = rest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.0451 -3.8809 0.0389 3.3918 17.7557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.9599 4.8063 -4.569 9.59e-06 ***
## Food 1.5363 0.2632 5.838 2.76e-08 ***
## Decor 1.9094 0.1900 10.049 < 2e-16 ***
## LocationWest -2.0670 0.9318 -2.218 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared: 0.6279, Adjusted R-squared: 0.6211
## F-statistic: 92.24 on 3 and 164 DF, p-value: < 2.2e-16
plot(restModNew)
influencePlot(restModNew)
## StudRes Hat CookD
## 56 3.2282969 0.02245111 0.2378828
## 117 0.4445866 0.18087990 0.1047159
## 130 2.9380865 0.06091881 0.3657472
vif(restModNew)
## Food Decor Location
## 1.389515 1.346030 1.038000
avPlots(restModNew)
anova(restModNew, restModFull)
## Analysis of Variance Table
##
## Model 1: Price ~ (Restaurant + Food + Decor + Service + Location) - Restaurant -
## Service
## Model 2: Price ~ (Restaurant + Food + Decor + Service + Location) - Restaurant
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 164 5366.5
## 2 163 5366.5 1 0.00156 0 0.9945
Given that the p-value is greater than 0.05 we cannot reject the null hypothesis that the service coefficient is zero, therefore the model which excludes the service variable is a better model.
restModFD <- lm(Price ~ Food + Decor, data = rest)
summary(restModFD)
##
## Call:
## lm(formula = Price ~ Food + Decor, data = rest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.945 -3.766 -0.153 3.701 18.757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.5002 4.7230 -5.187 6.19e-07 ***
## Food 1.6461 0.2615 6.294 2.68e-09 ***
## Decor 1.8820 0.1919 9.810 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.788 on 165 degrees of freedom
## Multiple R-squared: 0.6167, Adjusted R-squared: 0.6121
## F-statistic: 132.7 on 2 and 165 DF, p-value: < 2.2e-16
plot(restModFD)
By removing the location variable we see a slight increase in RSE to 5.788 and a slight decrease in the R^2 value to 0.6121. As with the previous two models, we still see that the QQ plot appears to show a violation of the normality assumption. In addition, the scale location plot also appears to show a violation of the independent errors assumption.
AIC(restModFull, restModNew, restModFD)
## df AIC
## restModFull 6 1070.711
## restModNew 5 1068.711
## restModFD 4 1071.677
Based on the AIC, the model that includes food, decor and location is the best model (with the lowest AIC). The worst model is the one that only includes food and decor.
BIC(restModFull, restModNew, restModFD)
## df BIC
## restModFull 6 1089.454
## restModNew 5 1084.330
## restModFD 4 1084.173
Based on the BIC, the model that includes just food and decor is the best model (with the lowest BIC), though only slightly better than the model which also includes location. The worst model is the one that includes all variables.
The results from part 4 and 5 were expected. I would ultimately choose to use the model which includes food, decor and location.